**Summary of the Document:** The paper introduces **DAPO (Decoupled Clip and Dynamic Sampling Policy Optimization)**, an open-source reinforcement learning (RL) system designed to enhance large language models' (LLMs) reasoning capabilities. Key contributions include: 1. **DAPO Algorithm**: - Improves RL training for LLMs by addressing issues like entropy collapse, reward noise, and training instability. - Outperforms previous state-of-the-art models (e.g., DeepSeek-R1) with **50% fewer training steps**, achieving **50 points on AIME 2024** using the Qwen2.5-32B base model. 2. **Key Techniques**: - **Clip-Higher**: Decouples clipping ranges to promote diversity and avoid entropy collapse. - **Dynamic Sampling**: Filters out zero-gradient prompts to stabilize training. - **Token-Level Policy Gradient Loss**: Balances contributions of long/short responses for better reasoning. - **Overlong Reward Shaping**: Reduces noise by softly penalizing truncated samples. 3. **Open-Source Release**: - Includes training code (built on the *verl* framework) and the **DAPO-Math-17K dataset** (curated from math competition problems). 4. **Results**: - Achieves **50 points on AIME 2024**, surpassing DeepSeek-R1’s 47 points. - Demonstrates emergent reasoning behaviors (e.g., self-reflection) during RL training. 5. **Impact**: - Democratizes access to scalable RL for LLMs by revealing previously undisclosed technical details. **Conclusion**: DAPO advances LLM reasoning through innovative RL techniques and full transparency, enabling reproducibility and future research. *(Summary tailored for clarity and conciseness while retaining technical depth.)*